10 research outputs found

    Influences on Throughput and Latency in Stream Programs

    Get PDF
    Vu Thien Nga Nguyen and Raimund Kirner, 'Influences on Throughput and Latency in Stream Programs' paper presented at the 2nd Workshop on Feedback-Directed Compiler Optimization for Multi-Core Architectures. Berlin, Germany. 22 January 2013Stream programming is a promising approach to execute programs on parallel hardware such as multi-core systems. It allows to reuse sequential code at component level and to extend such code with concurrency-handling at the communication level. In this paper we investigate in the performance of stream programs in terms of throughput and latency. We identify factors that affect these performance metrics and propose an efficient scheduling approach to obtain the maximal performance

    An Efficient Execution Model for Reactive Stream Programs

    Get PDF
    Stream programming is a paradigm where a program is structured by a set of computational nodes connected by streams. Focusing on data moving between computational nodes via streams, this programming model fits well for applications that process long sequences of data. We call such applications reactive stream programs (RSPs) to distinguish them from stream programs with rather small and finite input data. In stream programming, concurrency is expressed implicitly via communication streams. This helps to reduce the complexity of parallel programming. For this reason, stream programming has gained popularity as a programming model for parallel platforms. However, it is also challenging to analyse and improve the performance without an understanding of the program's internal behaviour. This thesis targets an effi cient execution model for deploying RSPs on parallel platforms. This execution model includes a monitoring framework to understand the internal behaviour of RSPs, scheduling strategies for RSPs on uniform shared-memory platforms; and mapping techniques for deploying RSPs on heterogeneous distributed platforms. The foundation of the execution model is based on a study of the performance of RSPs in terms of throughput and latency. This study includes quantitative formulae for throughput and latency; and the identification of factors that influence these performance metrics. Based on the study of RSP performance, this thesis exploits characteristics of RSPs to derive effective scheduling strategies on uniform shared-memory platforms. Aiming to optimise both throughput and latency, these scheduling strategies are implemented in two heuristic-based schedulers. Both of them are designed to be centralised to provide load balancing for RSPs with dynamic behaviour as well as dynamic structures. The first one uses the notion of positive and negative data demands on each stream to determine the scheduling priorities. This scheduler is independent from the runtime system. The second one requires the runtime system to provide the position information for each computational node in the RSP; and uses that to decide the scheduling priorities. Our experiments show that both schedulers provides similar performance while being significantly better than a reference implementation without dynamic load balancing. Also based on the study of RSP performance, we present in this thesis two new heuristic partitioning algorithms which are used to map RSPs onto heterogeneous distributed platforms. These are Kernighan-Lin Adaptation (KLA) and Congestion Avoidance (CA), where the main objective is to optimise the throughput. This is a multi-parameter optimisation problem where existing graph partitioning algorithms are not applicable. Compared to the generic meta-heuristic Simulated Annealing algorithm, both proposed algorithms achieve equally good or better results. KLA is faster for small benchmarks while slower for large ones. In contrast, CA is always orders of magnitudes faster even for very large benchmarks

    Monitoring framework for stream-processing networks

    Get PDF
    Vu Thien Nga Nguyen, Raimund Kirner, and Frank Penczek, 'Monitoring framework for stream-processing networks'. Paper presented at the Workshop on Feedback-Directed Compiler Optimization for Multi-Core Architectures (FD-COMA 2012), Berlin, Germany. 21-23 January 2013.In this paper we present a monitoring framework that exploits special characteristics of stream-processing networks in order to reason the performance. The novelty of the framework is to trace the non-deterministic execution which is reflected in i) the dynamic mapping and scheduling of network components at the operating system level and ii) the dynamic message routing across the network at runtime. We evaluate the efficiency with an implementation for the coordination language S-Net, showing negligible overhead in most cases

    Throughput-driven Partitioning of Stream Programs on Heterogeneous Distributed Systems

    Get PDF
    This is an Open Access article. © 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.Graph partitioning is an important problem in computer science and is of NP-hard complexity. In practice it is usually solved using heuristics. In this article we introduce the use of graph partitioning to partition the workload of stream programs to optimise the throughput on heterogeneous distributed platforms. Existing graph partitioning heuristics are not adequate for this problem domain. In this article we present two new heuristics to capture the problem space of graph partitioning for stream programs to optimise throughput. The first algorithm is an adaptation of the well-known Kernighan-Lin algorithm, called KL-Adapted (KLA), which is relatively slow. As a second algorithm we have developed the Congestion Avoidance (CA) partitioning algorithm, which performs reconfiguration moves optimised to our problem type. We compare both KLA and CA with the generic meta-heuristic Simulated Annealing (SA). All three methods achieve similar throughput results for most cases, but with significant differences in calculation time. For small graphs KLA is faster than SA, but KLA is slower for larger graphs. CA on the other hand is always orders of magnitudes faster than both KLA and SA, even for large graphs. This makes CA potentially useful for re-partitioning of systems during runtime.Peer reviewedFinal Published versio

    A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip

    Get PDF
    Nilesh Karavadara, Simon Folie, Michael Zolda, Vu Thien Nga Nguyen, Raimund Kirner, 'A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip'. Paper presented at the Int'l Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES'14), Dresden, Germany, 24-28 March 2014.Software developers are discovering that practices which have successfully served single-core platforms for decades do no longer work for multi-cores. Stream processing is a parallel execution model that is well-suited for architectures with multiple computational elements that are connected by a network. We propose a power-aware streaming execution layer for network-on-chip architectures that addresses the energy constraints of embedded devices. Our proof-of-concept implementation targets the Intel SCC processor, which connects 48 cores via a network-on- chip. We motivate our design decisions and describe the status of our implementation

    Statistical Performance Analysis of an Ant-Colony Optimisation Application in S-NET

    Get PDF
    Kenneth MacKenzie, Philip K. F. Hölzenspies, Kevin Hammond, Raimund Kirner, Vu Thien Nga Nguyen, Iraneus te Boekhorst, Clemens Grelck, Raphael Poss, Merijn Verstraaten, 'Statistical Performance Analysis of an Ant-Colony Optimisation Application in S-NET'. Paper presented at the 2nd Workshop on Feedback-Directed Compiler Optimization for Multi-Core Architectures. Berlin, Germany, 12 January 2013.We consider an ant-colony optimsation problem implemented on a multicore system as a collection of asynchronous stream- processing components under the control of the S-NET coordina- tion language. Statistical analysis and visualisation techniques are used to study the behaviour of the application, and this enables us to discover and correct problems in both the application program and the run-time system underlying S-NET

    Dynamic Power Management for Reactive Stream Processing on the SCC Tiled Architecture

    Get PDF
    This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.Dynamic voltage and frequency scaling} (DVFS) is a means to adjust the computing capacity and power consumption of computing systems to the application demands. DVFS is generally useful to provide a compromise between computing demands and power consumption, especially in the areas of resource-constrained computing systems. Many modern processors support some form of DVFS. In this article we focus on the development of an execution framework that provides light-weight DVFS support for reactive stream-processing systems (RSPS). RSPS are a common form of embedded control systems, operating in direct response to inputs from their environment. At the execution framework we focus on support for many-core scheduling for parallel execution of concurrent programs. We provide a DVFS strategy for RSPS that is simple and lightweight, to be used for dynamic adaptation of the power consumption at runtime. The simplicity of the DVFS strategy became possible by sole focus on the application domain of RSPS. The presented DVFS strategy does not require specific assumptions about the message arrival rate or the underlying scheduling method. While DVFS is a very active field, in contrast to most existing research, our approach works also for platforms like many-core processors, where the power settings typically cannot be controlled individually for each computational unit. We also support dynamic scheduling with variable workload. While many research results are provided with simulators, in our approach we present a parallel execution framework with experiments conducted on real hardware, using the SCC many-core processor. The results of our experimental evaluation confirm that our simple DVFS strategy provides potential for significant energy saving on RSPS.Peer reviewe

    Demand-based scheduling priorities for performance optimisation of stream programs on parallel platforms

    No full text
    This paper introduces a heuristic-based scheduler to optimise the throughput and latency of stream programs with dynamic network structure. The novelty is the utilisation of positive and negative demands of the streamcommunications. It is a centralised approach to provide load balancing for stream programs with dynamic network structures. The approach is designed for shared-memory multi-core platforms. The experiments show that our scheduler performs significantly better than the reference implementation without demand considerations

    A simple strategy to enhance the in vivo wound-healing activity of curcumin in the form of self-assembled nanoparticle complex of curcumin and oligochitosan

    No full text
    While the wound healing activity of curcumin (CUR) has been well-established, its clinical effectiveness remains limited due to the inherently low aqueous CUR solubility, resulting in suboptimal CUR exposure in the wound sites. Previously, we developed high-payload amorphous nanoparticle complex (or nanoplex) of CUR and chitosan (CHI) capable of CUR solubility enhancement by drug-polyelectrolyte complexation. The CUR-CHI nanoplex, however, exhibited poor colloidal stability due to its strong agglomeration tendency. Herein we hypothesized that the colloidal stability could be improved by replacing CHI with its oligomers (OCHI) owed to the better charge distribution in OCHI. The effects of key parameters in drug-polyelectrolyte complexation (i.e. pH, salt inclusion, CUR concentration, and OCHI/CUR charge ratio) on the physical characteristics and preparation efficiency of the CUR-OCHI nanoplex produced were investigated. The in vivo wound healing efficacy of the CUR-OCHI nanoplex and its cytotoxicity towards human keratinocytes cells were examined. The results showed that CUR-OCHI nanoplex exhibited prolonged colloidal stability (72 h versus 90% after 7 days versus 9 days for the native CUR resulting in smaller scars, attributed to its generation of high CUR concentration in the wound sites.Nanyang Technological UniversityAccepted versionThe authors would like to acknowledge the research funds from Vietnam Atomic Energy Institute and Nuclear Research Institute (Grant number: 07/17/VNCHN) and from Nanyang Technological University's Undergraduate Research Experience on Campus (URECA) for Suen Ern Lee
    corecore